Method and Apparatus for Communications Analysis

ABSTRACT

A method of grouping communication sessions, the method comprising: selecting a plurality of communications sessions from a data stream; determining which data structures, of said communication sessions, occur more frequently than chance; and sorting the communication sessions into groups, wherein communication sessions which have similar data structures, determined to occur more frequently than chance, are sorted into the same group.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a divisional of co-pending U.S. application Ser. No.13/365,760, filed Feb. 3, 2013, which claims priority to United KingdomApplication GB 1103492.3, filed Mar. 1, 2011, and United KingdomApplication GB 1101875.1, filed Feb. 3, 2011. Each of these priorapplications is hereby incorporated by reference in its entirety.

FIELD

The present invention relates to a method and apparatus forcommunications analysis. In particular, it relates to a method andapparatus for determining communications sessions having the sameprotocol structure.

BACKGROUND TO THE INVENTION

It is possible to extract information from a data stream with knowledgeof the communications protocols being used to send data. There is a needto be able to establish when communication sessions have similarstructure which may be indicative of an unknown protocol.

SUMMARY OF THE INVENTION

In a first aspect, the present invention provides a method of groupingcommunication sessions, the method comprising: selecting a plurality ofcommunications sessions from a data stream; determining which datastructures, of said communication sessions, occur more frequently thanchance; and sorting the communication sessions into groups, whereincommunication sessions which have similar data structures, determined tooccur more frequently than chance, are sorted into the same group.

In a second aspect, the present invention provides a method of groupingcommunications sessions, the method comprising: extracting a pluralityof communication sessions from a data stream, each communication sessioncomprising a sequence of characters; analysing the communicationsessions to determine sequences of characters which exhibit repeatablebehaviour; and sorting communications sessions having similar sequencesof characters into groups.

In a third aspect, the present invention provides a method ofconfiguring a sensor to extract data from a communication stream, usinga group of communication sessions representing a particularcommunications protocol, the group comprising data structuresrepresentative of that protocol, the method comprising: generating aplurality of records representing said data structures, each recordhaving a particular pattern; grouping said records based on thesimilarity of said patterns, such that each group includes recordshaving the same pattern; generating a template based on the pattern ofeach group; and configuring said sensor using said template.

Further features of the invention are defined in the appended dependentclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

By way of example only, the present invention will now be described withreference to the drawings, in which:

FIG. 1 is a flow diagram showing the operation of the present inventionin a first embodiment;

FIG. 2 shows a computer network in accordance with an embodiment of thepresent invention;

FIG. 3 shows a system in accordance with an embodiment of the presentinvention;

FIG. 4 shows a flow diagram showing the operation of the presentinvention in a further embodiment;

FIG. 5 is a histogram showing a plot of session frequency in the absenceof any protocols;

FIG. 6 is a histogram showing a plot of session frequency in thepresence of communication protocols;

FIG. 7 is a flow diagram showing the operation of the present inventionin a further embodiment;

FIG. 8 shows a system in accordance with an embodiment of the presentinvention; and

FIG. 9 is a flow diagram showing the operation of the present inventionin a further embodiment.

DETAIL DESCRIPTION OF PREFERRED EMBODIMENTS

The first embodiment includes an apparatus and method for determining,from a raw data stream, communication sessions which have a commonstructure. Common structure in communication sessions may be taken to beindicative of use of a common communication protocol. Hence, using thismethod, it is possible to establish that unrelated communicationsessions utilise common, but unknown, communication protocols. In thecontext of this description, a communication session is a unidirectionalstream of data that is travelling from a single source to a singledestination. It is therefore possible, using this method, to determinethat communication protocols exist, without prior knowledge of thoseprotocols.

FIG. 1 is a flow diagram showing the method of this embodiment. Aplurality of communication sessions, contained within a raw data stream,are extracted (block 100).

These communication sessions are then analysed to extract datastructures which occur more frequently than would be expected by chance(block 101). Finally, communication sessions having similar extracteddata structures are clustered together (block 102). Those communicationssessions which have similar extracted data structures may be consideredto be utilising the same communication protocols. The output of thisprocess is at least one group of communication sessions considered touse the same communication protocol. Further details of how each ofthese steps is undertaken are provided below.

FIG. 2 shows a computer network 110 which includes several computerterminals. These computer terminals are referred to as endpoints. TheFigure shows endpoints A 111, B 112, C 113, D 114, E 115 and F 116. Thenetwork 100 also includes a node 117. Data may be sent between theendpoints via node 107. This network is shown as an example of the kindof network which the present method may be used with in order to extractcommunication sessions. The network may be the Internet, for example. Asimple network is shown here in order to demonstrate the principles ofoperation of the method. It will be appreciated that the network may bemore complex than shown, as would be the case for the Internet.

Data may be sent between the endpoints. Typically data would be sent inthe form of a series of data packets from one endpoint to another. Forexample, the data may be sent in accordance with TCP/IP. For thepurposes of this example, the data sent across network 110 is done sousing TCP/IP. Data is routed via node 117. In this respect, node 107acts as a router. In practice, a network may contain many hundreds ofnodes. For the purposes of explaining the present method, only one isrequired. The various endpoints all communicate with each other usingone or more protocols (sub-protocols of the TCP/IP network protocols).

Further details of the components of the apparatus used to carry out themethod will now be described. In this example, the apparatus is locatedwithin node 117. For the purposes of this example, the apparatus shallbe referred to as a common data structure determination system 120. Thesystem 120 is shown in FIG. 3. The system 120 includes the variouscomponents which are required to carry out the method. It will beappreciated that in practise, some of these components may be combined,or alternatively, that the functionality of some components is providedby two or more further components. It will also be appreciated that thecomponents may be provided in hardware or software, the actualimplementation not being relevant to the function of the method. FIG. 4is a flow diagram showing the operation of the system 120.

The system includes a sampler 121. The sampler 121 is used to extractcommunication sessions from the raw data stream flowing through the node117. The process of extracting a plurality of communications sessions isrepresented by block 200 in FIG. 4. The sampler 121 takes a sample ofTCP/IP packets from the raw data stream (referred to hereinafter as the“bearer”). The sampler 121 randomly selects a packet. It then looks atthe address information in that packet (IP/TCP/UDP) and then furtherselects all packets in the same session.

The sampler 121 may select the initial packet used to select thesubsequent session data in a number of ways. For example, the sampler121 may randomly select packets from the bearer. This may be done byselecting every nth packet from the bearer. Alternatively, this may bedone by searching for a particular sequence of characters in the TCPsequence number field or by searching for a randomly generated patternin the packet payload. Rather than randomly selecting packets, thesampler 121 may select all packets containing a particular data type;for example, HTTP or certain types of compressed data. As a furtheralternative, packets may be extracted by searching for randomly selectedaddresses in the Network and Transport Layer protocols. Regardless ofthe process chosen, the sampler extracts a large number of packets froma number of communication sessions.

Once the sampler 121 has extracted enough packets, the packets must besorted into respective communication sessions. In other words, thepackets are sorted into unidirectional streams of data between twoendpoints, each endpoint being identified by an IP address. Such astream is a communication session. This is achieved by sorting thepackets into sets according to IP source address, IP destinationaddress, IP source address, IP destination address, TCP source portnumber and TCP destination port number, IP source address, IPdestination address, UDP source port number and UDP destination portnumber or permutations thereof.

For TCP, each set of packets is then put in a queue in TCP sequencenumber order and duplicated TCP data is removed. For any sets of packetsthat are carrying HTTP protocol data, the HTTP headers are analysed andthe associated data encodings are determined. If required, the HTTP datapayloads are decoded, so that the original, un-encoded data isrecovered. A similar technique may be applied to UDP packets. Followingthe above process, reconstructed, un-encoded data streams are recovered.These are the communication sessions. For a typical analysis, severalhundred megabytes may be sampled, resulting in several thousandsessions.

The system 120 further includes a data extractor 122. The purpose of thedata extractor 122 is to locate strings of data which may berepresentative of protocol structure. In order to do this, the extractor122 searches for entities located within each communication session. Theidea behind this is that a message sent between two entities typicallyincludes an identifier. For the purposes of this description, we shallcall the identifier an entity. For example, the entity may be a realname, such as John or Sarah. Alternatively, the entity may be an emailaddress, a username, a numeric identifier, a random string ofcharacters, a pre-defined string of characters, or a media filename. Ingeneral, a protocol will contain data structures which define theoperations of that protocol. For messaging protocols there will be datastructures that contain addressee information. The addressee informationis information designed into the protocol that is used to identifylogical entities within that protocol, such as a user. Thus, formessaging protocols one might expect an entity to appear in closeproximity to these protocol data structures. Therefore, if we can locatean entity this provides a means of identifying a potential protocol andof estimating where the data structures containing the addresseeinformation might be found within a session carrying said protocol.

The data extractor 122 includes an entity store 123 which storesentities used as the basis for searches for the communications sessions.The data extractor 122 also contains a number of bespoke entityidentifier methods. These methods include an email addressidentification method, a username identification method, a real nameidentification method, a numeric identifier identification method and ageneralised search method. In the following, only the method utilisingthe generalised search approach is described. However, any of the abovemethods may be used in isolation or combination to provide the rawtriple records described subsequently. In the context of this example,an entity is simply a string of characters which the data extractor 122must search for in the communication sessions. In this case, the entitystore includes a number of “real” names. In the present case, real namesare used. In this example, the entity store 123 includes the name“Neil”. The system 120 will therefore attempt to locate data in thecommunications streams which includes the name “Neil” and which maytherefore relate to a message sent using a particular protocol.

The data extractor 122 searches through all of the communicationsessions for the name “Neil” (block 201). Any communication sessionswhich include zero or one instance of the name “Neil” are excluded fromfurther analysis. If the communication session includes two or moreinstances of the name “Neil”, then it is used for further analysis.

When the data extractor 122 locates the name “Neil” it extracts theentity from the communication session, together with data in theimmediate vicinity of the entity (block 202). As noted above, the datain the vicinity of an entity may be expected to include the structure ofthe protocol used to send any message associated with the entity. Thedata extractor 122 extracts a fore-string and an aft-sting. Thefore-string is the set of characters immediately before the entity, andthe aft-string is the set of characters immediately after the entity.The data extractor 122 therefore produces a triple associated with theentity (fore-string, entity, aft-string). The fore-string and aft-stringare referred to as the entity's context.

In this case, the data extractor 122 locates all triples, across allcommunication sessions, including the name “Neil”. In this case, thefore-string and aft-strong are chosen to be 12 characters each, in orderthat the principle of operation may be clearly shown. However, inpractice the fore-string and aft-string may be any length. 128characters has been found to be particularly suitable. One example of atriple may be:

-   -   123From_(—)456:Neil;<To:>123456

Each triple is then associated with the communication session from whichit came from. Following this process, it can be expected that a largenumber of triples include contexts which include protocol structure.However, some of the triples may contain no protocol structure. Forexample, if the name “Neil” is located in the middle of some messagetext, the context may well only be other parts of the body of themessage. In the next stage, the system must differentiate betweencontexts with protocol structure, and contexts without such structure.

The system 120 includes a context processor 124. The context processoris responsible for processing all of the triples extracted by dataextractor 122 in order to determine which contexts are associated withprotocol structure. The context processor operates on the principle thatprotocol structure is likely to repeat itself across a number ofcontexts. Therefore, there is a requirement to distinguish betweencontexts which exhibit similarities with other contexts, and those thatdo not.

The context processor 124 is arranged to generate a plurality of ngramsfrom the context of each entity (block 203). An ngram is a sequence of ncharacters taken from the context. The context processor 124 is arrangedto generate ngrams that overlap by n−1. In this example, n=4. However, nmay be any number less than the length of the fore-string andaft-string. Ideally, n should be a low number, relative to the contextlength. Using the above example, the ngram sets would be as follows:

fore-string set: 123F, 23Fr, 3Fro, From, rom_, om_4, m_45, _456, 456:aft-string set: ;<To, <To:, To:>, o:>1, :>12, >123, 1234, 2345, 3456

For each communication session, all of the ngrams are formed into a setwhich represents that session. Accordingly, a large number of sets ofngrams are produced, each set being associated with a particularcommunication session.

As noted above, the system 120 needs to establish which ngrams arelikely to be part of a protocol structure, and which ngrams are notlikely to be part of protocol structure. Protocols, by their design,consist of fixed syntax blocks carrying fixed or variable data. Forcommunications traffic, those ngrams which form part of a protocolstructure may be expected to occur more frequently than those that donot. For all ngrams across all communication session sets, the contextprocessor 124 determines the session frequency for each ngram. This issimply the number of sessions in which the ngram occurs. The contextprocessor 124 generates a histogram of the session frequencies.

FIG. 5 is a histogram which shows the expected plot where thedistribution of ngrams is random, i.e. where no communications protocolsare present. A large number of ngrams with low session frequency wouldbe expected, with smoothly decreasing numbers as the session frequencyincreases. At a certain value of session frequency, the expected numberof ngrams drops to zero. In FIG. 5, C represents the maximum expectedobserved session frequency. A typical value of C will be between 20 and30.

When a communication protocol is present, non-randomness will beexpected in the distribution. This gives rise to two features, as shownin FIG. 6. Firstly, there will be significant departure from the smoothdecrease. Secondly, session frequencies significantly above C areobserved. These features are labelled as [1] and [2] respectively inFIG. 6. The ngrams which give rise to these anomalies are labelled as“interesting” ngrams. These ngrams are those which are expected torelate to part of a protocol structure. If, following this process, zeroor very few “interesting” ngrams are located, the process terminateswithout producing any outputs (block 204).

Now that the interesting ngrams have been identified, each session isrepresented by a set of those interesting ngrams. The system 120 alsoincludes a session cluster processor 125. The session cluster processor125 is arranged to group communication sessions which include similarngrams, and which may therefore be assumed to include the samecommunications structure.

The session cluster processor 125 contains a vector processor 126. Thevector processor 126 is arranged to allow the similarity of differentsessions to be measured. To achieve this, the set of ngrams associatedwith each session are represented as a vector and vector analysis isused to establish how similar the sessions ngrams are to each other. Thevector processor 126 is arranged to generate a vector to represent eachsession (block 205). Each interesting ngram in a session is designated aseparate dimension of a vector. For example, using the fore-string notedabove:

-   -   123F=i    -   23Fr=j    -   3Fro=k    -   etc

The session can then be represented by a vector V:

V=i+j+k+l+m+n+o+p+q

Those ngrams which occur with a higher frequency will result in a largevector component. Each session is represented by it's own vector.Accordingly, following vector processing, the cluster processor 125holds a large number of vectors, each representing a session.

In order to determine which sessions are likely to include similarprotocols, a distance measure is used. For example, a cosine similaritymeasure may be used to determine the angle between each vector. For eachsession in the collection the vector processor 126 calculates thedistance between said session and each other session in the collection.These distances are then stored.

The set of distances and references to the sessions to which they belongare then provided to the cluster processor 125. The cluster processorthen clusters (block 207) the sessions by using the distance between thesessions as a clustering metric. This establishes which sessions havesimilar properties. For example, an algorithm such as the ‘Density-BasedSpatial Clustering of Applications with Noise’ (DBSCAN) may be used. Anadvantage of this algorithm is that it is fast and can locatearbitrarily-shaped clusters. When applying this algorithm in the presentcontext, clusters range in size from a few to a few hundred sessions.

Following the clustering operation, each cluster is considered toinclude only sessions which use the same underlying communicationsprotocols. The cluster processor does not determine what the protocolis, rather it determines the fact that a particular group of sessionshave common structure which, with a high degree of certainty, can beassumed to represent a particular protocol.

The information relating to ngrams in each cluster may then be storedfor further analysis. This may be in the form of human intervention, tovisually inspect the ngrams to establish what protocols are being used.Alternatively, the interesting ngrams may be used to program a sensor todetect data in the raw data stream which contains those ngrams. Thisallows for the extraction of further sessions which contain protocolstructure which is the same as that identified by the above process.This allows the identified protocols to be filtered out of the datastream without needing to record all of the traffic i.e. we only recordthe bit we are interested in which is the protocol data that fits thedescribed model. The remaining data is discarded.

The above described embodiments relate to the identification of sessionswhich relate to the same communication protocols. The next stage focuseson whether the information associated with a cluster of sessions(abstract representation of a protocol) can be used to identifytemplates for the extraction of all instances of an entity from aprotocol of interest. A template is defined that describes the expecteduse case of an entity (e.g. a user's identifier) within communicationsdata. As described above, a triple defined by fore-string; entity;aft-string describes the entity and the surrounding protocol structure.This triple can be used to define a template having the form:

-   -   PATTERN ENTITY PATTERN

The purpose of the following embodiments is to automatically work outthe format of this template given the session vector discovered above,and to do this in an unsupervised manner. Once a template has beenestablished it will subsequently be used to extract every instance of anENTITY from an arbitrary data stream. Here the ENTITY has the samedefinition as its did for the above-embodiments.

The PATTERN parts shown above are the fore-string and aft-stringdescribed previously. The PATTERN part may consist of a mixture of fixedand changing components. For example, the patterns:

  From_123456; and From_743

both have the characters ‘From_’ in common. The characters 123456 and743 are dissimilar. The fact that we have already decomposing thefore-string and aft-strings into ngrams essentially allows the constantparts to be identified. Once the ngram is small enough, only theconstant part will remain. For example, when the ngram length reaches 5then, for the above example, the ngram components are:

  From_, rom_1, om_12, m_123, _1234, 12345, 23456 and From_, rom_7,om_74, m_743

We see here the only common component is ‘From_’. It is the repeatedappearance of this ngram that allows the protocol to be detected. If thewhole string were used then we would find that the contexts describedpreviously would not cluster together. Similarly if the ngrams were toosmall they would be indistinguishable from general characters.

In order to successfully extract the ENTITY part of the template theleft and right hand edges of the fore-string and aft-string must beidentified. In addition, the signatures that strongly define a protocolmay not be the same as the signatures that define the content ofinterest. For example, the signature ‘From_’ may occur in many protocolsand hence will be discarded by the first embodiments as it occurs inmany sessions. However, the signature ‘From_’ could represent the senderof a message and is consequently of interest. Moreover in order to findthe ‘From_’ part of the signature, we must know which bit of it iscommon to all instances as well as the parts of the signature that varyfrom instance to instance. This latter steps allows the variable bits tobe ignored. However, we do need to know where the variable bit finishesin order to distinguish it from the ENTITY part.

In terms of the aft-string it is only necessary to identify a singlecharacter as it is simply used as a means to terminate the template.Thus the template can be slightly modified as:

-   -   PATTERN ENTITY TERMINAL_CHARACTER

A single record consisting of: “PATTERN TERMINAL_CHARACTER” can then becomposed. The method and apparatus for establishing templates will nowbe described.

FIG. 7 is a flow diagram showing the method of this embodiment. Thengrams from all sessions within a session cluster are extracted usingthe centroid vector for that cluster (block 300). The ngrams are thenused to extract packets or sessions including those ngrams from the datastream (block 301). The extracted records are then clustered (block302). The records in a particular record cluster can then be used todetermine templates for extraction of additional records (block 303).Finally, the templates are used to configure a sensor (block 304)Further details of each of these steps will be provided below.

The node 117 also includes a sensor configuration system 400. The system400 is shown in FIG. 8. The system 400 includes the various componentsfor carrying out the method. It will be appreciated that in practise,some of these components may be combined, or alternatively, that thefunctionality of some components is provided by two or more furthercomponents. It will also be appreciated that the components may beprovided in hardware or software, the actual implementation not beingrelevant to the function of the method. FIG. 9 is a flow diagram showingthe operation of the system 400.

The configuration system 400 includes an ngram extractor 401. The ngramextractor 401 extracts all ngrams from all sessions in a particularsession cluster (block 501). This is done using the centroid vector ofthat session cluster. Accordingly, the system 400 generates a collectionof all ngrams which appear in the contexts of the sessions from aparticular cluster.

The extracted ngrams are then used to extract new sessions from the rawdata stream flowing through node 117. The system 400 includes a packetextractor 402. The extractor 402 is configured conduct a string searchof the raw data (block 502) for any of the ngrams identified above. Theextractor 402 is programmed to extract any packet or session associatedwith a packet which includes one of the ngrams. The extractor 402 checkseach hit within each packet to see if an entity is within 128 bytes ofthe located ngram (block 503). If so, the packet is kept and theassociated session is captured. If not, the packet is discarded.Accordingly, a collection of packets/sessions is established, each ofwhich has at least one ngram within 128 bytes of an entity. As analternative to searching the raw data stream, the data extracted in thefirst embodiment can be searched instead. Similarly, data could alsojust be randomly sampled using the same techniques used in the firstphase. The processing described above can then be applied to thecaptured data.

The system 400 also includes a pattern generator 403. The patterngenerator 403 is arranged to formulate a pattern record from each of thengrams hit within a session (block 504). Each of the above-noted ngramsis followed by an entity which in turn is followed by a string ofcharacters. A pattern record is generated by taking the 128 bytes thatproceed the entity (called the PATTERN) and a single byte following theentity (called the TERMINAL STRING). Accordingly, a collection ofpattern records having the format PATTERN+TERMINAL STRING are generated.

The configuration system 400 also includes a record cluster processor404. The record cluster processor 404 selects two records and matchesthem using the Needleman-Wunsch algorithm (block 505). This algorithmaligns two strings of characters using a similarity matrix. Accordingly,the pattern records are aligned with respects to similar groups ofcharacters. For example, take the following four records (and entities):

  123From_457:another@hotmail.com; 124From_458:another@gmail.com;125From_459:another@gmail.com; 126From_460:another@gmail.com;

The algorithm would align the records so that the common characters“From_” are aligned. Effectively, the algorithm identifies where therecords are similar and where they are different. This is applied to allpairs of records which have been extracted from a session.

The record cluster processor 404 then applies the output of theNeedleman-Wunsch algorithm to a similarity measure (block 506). Forexample, a cosine-like similarity measure may be used.

However, a problem with the standard cosine measure is that it discardsthe information associated with the sequence of the characters within arecord. For example, the string abcdabcd can be represented as thevector 2i+2j+2k+2l (a→i, b→j, c→k, d→l). However the information that bfollows a and c follows b has been lost. In the current case the orderof the characters as well as their value is important. In addition, thestandard cosine approach doesn't naturally handle misaligned sections ofdata. Vector components that are not shared by the vectors are ignoredwhen a dot product is formed. Consequently, an alternative distancemeasure is used. Notionally, this measure constructs a right-angledtriangle with sides having length x and y on either side of theright-angle. Regions where the two records are the same contribute to anincrease in the length of side x and regions where the two records aredifferent contribute to an increase in the length of side y. The anglewhich represents the similarity between the two records can then beidentified by tan⁻¹(y/x).

The operation of this function is also weighted to prevent unwantedskews in the distance measure. In particular:

-   -   For runs of matching characters the x axis is not increased        indefinitely, here the x axis extension produced falls off        exponentially for each additional character within the run. This        presents long runs of positively aligned characters from        dominating the distance measure.    -   For sequences that are mismatching there are a couple of        possibilities:        -   Wildcard matches can contribute to the x axis extension e.g.            wildcard numeric will match any number but not as strongly            as an exact match e.g. 8 matches 8 exactly but 9 is still a            wildcard numeric match thus an alignment such as this still            extends the x axis; and        -   Where there is a run of mismatches/partial matches the            approach will calculate what the highest extension score is            for the whole run. This will then be used to extend the y            axis for the run of mismatching characters. Thus, the            extension of the y axis for a run of characters is capped.    -   The sequence information is essentially provided by a        combination of the extension calculations and the alignment        provided by the Needleman-Wunsch algorithm:        -   If a number of character runs are aligned successfully then            the contribution to the x axis extension will be higher; and        -   If the number of character runs is low and the alignment is            bad this will lead to a higher contribution to the y axis            extension.    -   Thus, the character sequencing will become evident through the        angle between the candidate records.

The output of this part of the process is data concerning the similarityof all the aligned records with respect to each other.

The record cluster processor 404 then applies a cluster algorithm to thesimilarity data produced by the similarity measure (block 507). The aimof this process is to identify common sections of the records which canbe used to derive sensor configuration patterns. Accordingly, fairly“compact” clusters are required. It has been found that a “k-means-like”algorithm gives good results. It can then be assumed, with a high degreeof certainty, that each record cluster includes records having the sameprotocol structure. The four records noted above may be an example ofthis.

In order to use a k-means-like algorithm, a representation of a clusteris required that is compatible with an individual record. To meet thisrequirement, a cluster is represented as a wild-carded record. This isjust like a regular record, except that some of the characters arereplaced by “wild cards” that can represent either single instances orextended sequences of numeric, alphabetic, or arbitrary characters. Useof this representation has required a small extension to the usualNeedleman-Wunsch algorithm so that it can operate with the wild-cardedrecords. However, once two records are matched, it does become fairlyclear how to construct an appropriate wild-carded record: where the twoindividual records match, the common text is simply selected. Wherethere is a difference, the nature of the difference determines the kindof “wild card” that is substituted.

The Needleman-Wunsch algorithm has been extended so that the class ofitems in the strings has expanded. Instead of being restricted toliteral characters, the class of items now includes a number of wildcards or character classes, such as <digit> (numbers), <space>(whitespace), <alphanumeric> (letter or numbers), etc. The comparisonweight function is extended to handle the wild cards so that, forexample, matching a literal ‘1’ with <digit> gives a reasonable matchweight; matching <digit> and <space> gives a mismatch. The insert costfunction is modified slightly to favour extending wildcards so that it'sgood to insert a digit immediately next to a match against <digit>, forexample.

Once the best alignment has been found, the whole is encoded as a newwild carded string (if this is required—e.g. to follow a clustercentre). New or modified wildcards are added where the two sequences donot align perfectly. Simple examples include:

  food match ford -> fo<alphanumeric>d freda match fred1a ->fred<digit>a fred2a match fred<digit>a -> fred<digit>a fred<digit>amatch fren2a -> fre<alphanumeric>a fre<alphanumeric>a matchfo<alphanumeric>d -> f<alphanumeric>

So, if you'd decided that food, ford, freda, fred1a, fred2a and fren2awere all in the same cluster, you'd get the cluster centref<alphanumeric>. At some point, the character counts are restored sothat it's known there are between 3 and 5 characters in the matchagainst <alphanumeric>; the appropriate regular expression is theneasily formed as f<3-5 alphanumeric>.

This representation of a cluster also helps in construction of theassociated sensor configuration pattern (block 508). The wild cardedrecord corresponds naturally to a regular expression that can be used tomatch the text that surrounds the occurrence of a entity. Theconfiguration system 400 also includes a template generator 405. Thetemplate generator 405 generates sensor configuration templates, basedon the clustered records (block 509). The sensor configuration patternconsists of this expression combined with an additional expression tomatch and output the entity itself. For example, a cluster containingthe above-noted contexts may have a representation such as:

-   -   xxxFrom_xxx:entity;

This is then used to program a sensor 406 to extract all data containingthis structure. This data may then be stored for further analysis.

Features of the present invention are defined in the appended claims.While particular combinations of features have been presented in theclaims, it will be appreciated that other combinations, such as thoseprovided above, may be used.

The above embodiments describe one way of implementing the presentinvention. It will be appreciated that modifications of the features ofthe above embodiments are possible within the scope of the independentclaims.

1. A method of configuring a sensor to extract data from a communicationdata stream, using a group of communication sessions representing aparticular communications protocol, the group comprising data structuresrepresentative of that protocol, the method comprising: generating aplurality of records representing said data structures, each recordhaving a particular pattern; grouping said records based on thesimilarity of said patterns, such that each group includes recordshaving the same pattern; generating a template based on the pattern ofeach group; and configuring said sensor using said template.
 2. A methodaccording to claim 1, wherein, each record includes an entity and thecontext of that entity.
 3. A method according to claim 1, furthercomprising: extracting said data structures from a group ofcommunication sessions, prior to generating said records; searching foroccurrences of those structures in a data stream; extracting packetscontaining those occurrences; and generating said records on the basisof the extracted packets.
 4. A method according to claim 3, furthercomprising aligning the records which have been generated.
 5. A methodaccording to claim 4, wherein the alignment is performed using theNeedleman-Wunsch algorithm.
 6. A method according to claim 4, furthercomprising determining the similarity of the records after aligning therecords.
 7. A method according to claim 6, wherein the step ofdetermining similarity is performed using a similarity measure.
 8. Amethod according to claim 7, wherein the similarity measure is acosine-like similarity measure. compare each session using a similaritymeasure; and cluster the sessions based on vector similarity.